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The uncertainty or the variability of the data may be treated by 
considering, rather than a single value for each data, the interval of 
values in which it may fall. This paper studies the derivation of basic 
description statistics for interval- valued datasets. We propose a geo- 
metrical approach in the determination of summary statistics (central 
tendency and dispersion measures) for interval-valued variables. 
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1 Introduction 

In descriptive statistics, summary statistics are used to synthesize a set of 
real observations. They usually involve: 

- a measure of location or central tendency, such as the arithmetic mean, 
median, interquartile mean or midrange, 

- a measure of dispersion like the standard deviation, range, interquartile 
range or absolute deviation. 

In this paper, we focus on obtaining basic descriptive statistics as central 
tendency and dispersion measures for interval-valued data. Such data are 
often met in practice, they typically reflect the variability and/or uncertainty 
that underly the observed measurement. Interval data is a special case of 
'symbolic data', which also comprises set- valued categorical and quantitative 
variables as described, e.g., in Bock and Diday (2000). 

Empirical extensions of summary statistics to the calculation of the mean 
and variance for interval valued-data have been given by Bertrand and Goupil 
(2000) and for histogram- valued data by Billard and Diday (2003). 

In this paper, we propose a geometrical determination of summary statis- 
tics (mean, median, variance, absolute deviation,....) for interval- valued vari- 
ables. This approach mimics the case of real-valued variables, with the ab- 
solute value of the difference between two real numbers being replaced by a 
distance between two intervals. 

For real-valued variables, a geometrical way for defining a central value c 
of a set {x\,X2, ....,x n } of n real observations is to choose celas close as 
possible to all the Xj's. Let us define the function S p : 




for p < oo, 
for p = oo, 



(1) 
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where x e W 1 is the vector of the n observations Xj, || • || p is the L p norm on 
MJ 1 , and c = cl n with I n the unit vector. Then one can use 



as a central value and S p (c) as the associated dispersion measure. The above 
minimization problem has an explicit solution for p — 1,2, oo. 

• When p — 1, the central value is c = xm (the sample median) and the 
corresponding dispersion is Si(xm) = £t=i I x « — |= where 
% is the average absolute deviation from the median. 

• When p — 2, the central value is c = x (the sample mean) and the cor- 
responding dispersion is S 2 (x) = ^/^2™ = i(xi — = V ( n ~ -0 s wnere 
s is the sample standard deviation. 

• When p = oo, the central value is c = xr (the midrange) and the 
corresponding dispersion is ^(ifl) = max i=1 . n \ Xi — xr \— \w where 
w is the sample range. 

The pairs (x, s 2 ), (xm,sm) and (x r ,w) are then consistent with the use of 
respectively the L\, L 2 and norms in the function S p . 

For interval-valued variables, we will use the above geometrical approach 
to define coherent measures of central tendency and dispersion of a set 
{xi, x 2 , x n } of n intervals Xt = [a«, foj] G / = {[a, b] \ a, b e R , a < b}. A 
measure of central tendency c is now an interval c = [a, f3\ defined in order 
to be as close as possible to all the x^s. Replacing in (pQ) the terms | Xi — c \ 
by a distance d(xi,c) between two intervals leads to the function S p defined 



c = argmin S p (c), 



(2) 



by: 




(3) 



c = argmin S p (c), 



(4) 
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and the corresponding dispersion measure is S p (c). 

In the following, after a brief recall of some definitions of distances be- 
tween intervals (section [2]), we exhibit in section [3] particular cases of value p 
and distance d for which explicit formula of the lower and upper bounds of 
central intervals c have already been developed. Then we resolve in section 
H] the case where p = 2 and d is the Hausdorff distance and we show how 
the corresponding central interval can be computed in a finite number of 
operations proportional to n 3 . We generalize in section [5] all these results to 
hypercubes. Finally, concluding remarks are given in section [6l 



2 Distances between intervals 

Many distances between intervals have been proposed. They vary from simple 
ones to the more elaborated ones. Elaborated distances taking into account 
both range and position have been proposed in the framework of symbolic 
data analysis (see for instance, Chapter 8 and 11.2.2 of Bock and Diday, 2000, 
De Carvalho, 1998, Ichino and Yaguchi, 1994). Simple distances commonly 
used to compare x\ = [ai, &i] and x 2 = [0,2, b 2 ] are the L p distances between: 



the two vectors and of the lower and upper bounds, 




or the two vectors ( ) and ( ^ 2 j of the midpoints m; = - ^ - 

\. — n. 

and the half-lengths L = 

a 2 

General distances between sets like the Hausdorff distance (see Nadler, 
1978), can also be used to compare two intervals. In the case of two intervals 
X\ = [oi,&i] and x 2 = [a 2 ,b 2 ], the Hausdorff distance has the property to 
simplify to: 

d(xi,x 2 ) = max (| a x - a 2 |, | &i — b 2 |) . (5) 
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By replacing in fl5]) the lower bound czj by (m ; — Zj) and the upper bound 
bi by (m ; + and according to the following property defined for x and y 
in E, 

max(|x — y\, \x + y\) = \x\ + \y\, 
one can show that the Hausdorff distance can be written as: 

d([ai, bi], [a 2 , b 2 ]) =\ m 1 - m 2 | + | h - h \ ■ (6) 

The Hausdorff distance between intervals has then the interesting prop- 
erty to be, at the same time, 

- a distance between sets, 

- equal to the L ro distance between the vectors f 1 1 and 



h V b 



•2 



equal to the L x distance between the vectors I 1 ] and 
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3 Existing results on central intervals 

Explicit formula of the central interval c — [a, p\ — arg min^g/ Sp(c) can be 
found in some particular cases. We remind these results already obtained 
and used in previous works (see for instance Chavent and Lechevallier, 2002, 
Chavent, 2004, De Carvalho et al., 2006). 

3.1 L\ combination of Hausdorff distances 

When p = 1 and d is the Hausdorff distance, S p (c) reads: 

n 

5 1 (c) = ^(|m J -/i| + |/,-A|), (7) 
i=i 

where /i and A are the midpoint and the half-length of c = [a,P]- 
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Minimization of Si(c) boils down to the two minimization problems: 

n n 

min> \rrii — a\ and min> |L — A|. 



i=l 



Theorem 1 In case of an L\ combination of Hausdorff distances, the mid- 
point fx and the half-length A of the central interval c are: 

ft = median{mi \ i — 1, . . . , n}, A = median{li \ i — 1, . . . , n}. (8) 

3.2 Lqo combination of Hausdorff distances 

When p = oo and d is the Hausdorff distance, S p (c) reads: 



Soo(c) = max max < | — a |, | hi — ft \ \ , 

i=l,...,n L ' J 



i.e. 



S'oo (c) = max < max | a, — a | , max | bi — f3 \ \ . 

I i=l,...,n i=l,...,n J 

Minimization of S 00(c) boils down to the two minimization problems: 
min max la* — a\ and min max \bi — f3\ . 

a<=R i=l,...,n /3eR i=l,...,n 



(9) 



Theorem 2 In case of an combination of Hausdorff distances, the lower 
bound a and the upper bound f3 of the central interval c are: 



& _ a (n) ~ ^ _ 6(n) ~ &(1) 



(10) 



where a( n ) (res;?. 6( n )j £/ie largest lower bound (resp. upper bound) and 
(resp. bci)) is the smallest lower bound (resp. upper bound). 
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3.3 L2 combination of L2 distances 

For p = 2, an explicit solution is easily denned when d is the L2 distance 
between either the middles and half lengths of the intervals or between their 
lower and upper bounds. For instance in the first case, S p (c) reads: 



S 2 (£) 



^d(x h c)) 2 



J2(\m t -^\r + (\k-\\r . (11) 



\ i=l \ i=l 

Theorem 3 In case of an L2 combination of L2 distances between midpoints 
and half lengths, the midpoint jj, and the half-length A of the central interval 
c are: 

^ n 1 " 

a — — } rrii and A = — > k . 
n ^— ' n ' 

i=l i=l 

In case of an L2 combination of L2 distances between lower and the upper 
bounds, the lower and upper bounds of the intervals of the central interval c 
are: 

^ n 1 n 

— 7 ai and (3 = — > bi 



n z — ' n 

i=l i=l 



4 Main result 

We study here the case of an L2 combination of Hausdorff distances. When 
p = 2 and d is the Hausdorff distance, S p (c) reads: 

2 n 

(s 2 (cY) = ^(maxfl a, - a |, | h - (3 \f . (12) 

i=i 

Theorem 4 In case 0/ an L2 combination of Hausdorff distances, the cen- 
tral interval c which minimizes ( TZDj) can 6e computed in a finite number of 
operations proportional to n 3 . 

Proof: The square is an increasing function over positive numbers, so for- 
mula (Tl2l) can be rewritten: 

2 n 

(^(c)) = ^ max ((a, - a) 2 , (6, - (3f) . (13) 

8=1 
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On the other hand, using midpoints and half-lengths, one obtains: 

(tH - a) 2 - {hi - (3) 2 = -A(rm - fi)(k - A) . 

So we see that the maximum in f[T3j) is (a^ — a) 2 if (mj — /x) (Z^ — A) < 0, and 
{h-(3) 2 \{ (rrii - - A) > 0. 

Let us denote by (777(1), . . . , m( n )), resp. (/(i), . . . , i( n )), the sample of the 
midpoints, resp. the half-lengths, organized in increasing order. Let us define 
the intervals: 

Mj = [m (j) , my+ij], j = 0, ... ,n, . 
Lk = [l(k)J(k+i)), k — 0,...,n, 

with m(o) = ^(o) = ~ 00 an d ^77(n+i) = hn+i) = +oo. For all (/i, A) in any 
rectangle Qj^ = Mj x L & , the product (mj — — A) has a given sign, for 

each i = 1, ... ,n. So the formula ffT3l) for fs^c)^ simplifies over such a 
rectangle to: 

SjAc) = E - «) 2 + E ( 6 < - > ( 15 ) 

where: 

= {i 6 {i . . . n} |( m , - !Mip±ii )(ij _ fe±W,) < 0} _ (16) 
i bJ , t = p 6 {i . . - " lw+ 2 m » + " ) ((, - fe^e±a) > o}. (it) 

Hence the minimization of ^^(c)^ over M 2 is equivalent to the resolu- 
tion, for j, k — 0, 1 . . . n, of the (n + l) 2 constrained quadratic problems: 

{find (a, /3) = (6(j,k, j3j,k) which minimizes Sj t k(a, (3) 
under the constraints: (18) 
2m(j) < a + (3 < 2m^ +1 ) and 21 ^ < /5 — a < 21^+1) 

whose resolution is described in the Appendix. 
The central interval c = [a, /3] is then given by: 

(d,/3) = arg min S jt Pj, k)- (19) 

j,k=0,l...,n 
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Because the number of operations in the resolution of (lisp is proportional 
to n, the number of operations for the calculation of (a, j3) is proportional 
to n 3 . 



5 The multidimensional case 

We consider now a set of n A;- dimensional intervals {xi, . . . , x n } with 5q = 
[aj, hi] and aj, bj e R k . A &;-dimensional interval 5q can also be viewed as a 
regular hyperparallelepiped Xj = n^-i^i with = [a^',fef'] where (resp. 
b\) is the jth coordinate of a^ (resp. bj). By misuse of language the Xj's will 
be called hypercubes in the rest of the paper. 

The above geometrical approach can then be used to define a central 
hypercube (also called centrocube or prototype) of a set of n hypercubes 
{xi, . . . ,x n }, which is now a k- dimensional interval c = [ct,f3] with a and 
(3 in M fc . Replacing in ([3]) the terms d(xi,c) by a distance £)(5q,c) between 
two hypercubes leads to the function S p defined by: 

S p{i) J^M^ fo rp <oo, 
I maxj = i... n iJ(Xj, c) for p = oo. 

The centrocube c = [a, /3] is then be defined by 

c = argmin S p (c). (21) 

There exists many possible distances between hypercubes (see for instance 
Bock, 2002) . Once again, depending on the distance D and on the value p 
in S p (c), the centrocube is more or less difficult to calculate. 

A first distance D that could be used is the Hausdorff distance between 
two hypercubes: 

D(xi, x 2 ) = max(/t(xi, x 2 ), h(5t 2 , x x )) (22) 
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with 



/i(xi,x 2 ) = sup inf 8(a,b) (23) 



where 5 is an arbitrary metric on M. k . We have seen that in the one- 
dimensional case, the Hausdorff distance simplifies to ([5]) but the calculation 
of this distance for higher dimensions is more involved and depends of the 
choice of the metric 5. If 5 is the Euclidean metric for instance, there exist 
algorithms that compute the Hausdorff distance between two hypercubes in 
a finite number of steps (see e.g., Bock, 2005) but as far as we know, there 
exist no algorithm to compute the centrocube. If 5 is the metric, an ex- 
plicit solution of the centrocube exists when p = oo (see Chavent, 2004). In 
other cases, the definition of centrocubes for the original Hausdorff distance 
between hypercubes still remains a subject to investigate. 

Another approach which makes explicit definitions of centrocubes easier 
to find, is to use a distance D that is a combination of coordinate-wise one- 
dimensional interval distances d: 

B(il , i2) = (G^-(^« *»,<», 

I maxj = i k a{x±, x J 2 ) for q = oo. 
When p = q, ^S p (c)^j reads: 

(~ \ p n k 

/ j=l j=l 

Because d(xl, e 7 ) > 0, it sufficient to find for each component j the central 
interval & which minimizes X^=i &) , so that the centrocube is the 
product of the central intervals of each variable. The results presented in 
sections 3 and 4 concerning central intervals can then be applied directly to 
define this 'coordinate- wise' centrocube. 
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6 Concluding remarks 

In this paper, we proposed different solutions for the determination of central 
intervals and hypercubes. These results have applications in clustering. In- 
deed, the existence of explicit formula for the computation of the centrocube 
is useful in dynamic clustering (see Diday and Simon, 1976), because it en- 
sures the decreasing at each iteration of the criterion S p . 'Coordinate- wise' 
centrocubes have been defined as prototype in several dynamical clustering 
algorithms of interval data. The 'coordinate- wise' centrocube for p = q = 1 
is used with the Hausdorff distance in Chavent and Lechevallier (2002) and 
with the Li distance between the lower and the upper bounds in De Souza 
and De Carvalho (2004). The case p = q = 2 is used by de Carvalho et 
al. (2006) with the L 2 distance between the lower and upper bounds. The 
algorithm proposed in section 4 for the determination of the central interval 
in the case of L 2 combination of Hausdorff distances gives a solution for the 
case p = q = 2 and the Hausdorff distance. 

Another application of these results concern the data scaling. Dealing 
with scalar variables measured on very different scales is already a problem 
when comparing two objects globally on all the variables. For instance, the 
Euclidean distance or more generaly the L q distance will give more impor- 
tance to variables of strong dispersion and the comparison between objects 
will only reflect their differences on those variables. A natural way to avoid 
this effect is to use a normalized distance. A L q normalized component-wise 
distance between hypercubes could then be: 

max j=1 ... fc ~,y for q = oo. 

where S(c?) is the dispersion measure associated to a central interval c J . For 
coherency reasons, it seems reasonable to use the same exponent (q = p): 

- to aggregate the intervals in the search of the central interval and the 
evaluation of the dispersion for each variable (exponent p in (j3J)), 
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- and to evaluate the distance between objects (exponent q in (I2T)!) ). 

To conclude, a natural extension of these results concerns weighted central 
tendency and dispersion measures. This point is currently under investiga- 
tion. 
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Appendix: Resolution of problem (Pj k ) 

We describe here the resolution of one of the minimization problems (P . k ) 
of equation (fl8l) . We drop the subscripts j, k, and we write m_ instead of 
?7i (j) , m + instead of m^n, Z_ instead of lu^ and l + instead of We use 

the midpoint and half-length variables fi = (a + /3)/2 and A = ((3 — a)/ 2, 
and we denote by Q the rectangle 

Q = {(fi, A) such that m_ < \i < m + and Z_ < A < /+} . (27) 

With these notations, the problem to solve is now: 

(P) find (fi, A) which minimizes S(n, A) over Q, (28) 

where the objective funtion is: 

S(ji, A) = J2( a * - ^ + A) 2 + A ) 2 ' ( 29 ) 

tela i&h 

with I a and lb defined respectively in <HM and flTjl This objective function is 
convex and quadratic (the level lines of S are - possibly degenerated - ellipses 
with axis parallel to the directions A = fi and A = — /i), and the constraints 
in (127]) are linear, so that the resolution of (P) is equivalent to that of the 
associated Kuhn- Tucker system of necessary conditions. 
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We describe now the corresponding algorithm. We have eliminated the 
consideration of some dead-end cases by taking advantage of the convexity 
of the problem: when the solution (/}, A) of (P) is on one edge of Q (possibly 
at a corner of Q) , the unconstrained minimizer (/i, A) of S and the center 
of Q are necessarily on different sides of the line containing this edge. Hence 
the edges of Q which can possibly contain the solution (/}, A) are those which 
contain the L 2 -projection of (//, A) on Q. 

We suppose for simplicity that the midpoints and half-length of all inter- 
vals are distinct: 



m ( i) < m (2 ) < . . . < m (n ) 

/(!) < 1(2) < ... < l(n) 



(30) 



One computes first, in a loop from i to n over the samples: 

n a = E ie / a l , rib = E ie i b 1 , 
A = , B = Z ie i b bi, (31) 

^ = E ie ^ 2 , B 2 = E l ei b bL 

with the convention that the sum is zero if the set I a or of indices is empty. 
Notice that n a is the number of indices in I a , and is the number of indices 
in so that n = n a + rib. With these notations, the gradient of S: 

~ Ei 6 / a ( a i - ^ + A) - T.ieiS hi ~ V ~ A ) 
+ J2iei a ( a i ~ V + A ) - J2iei b ( b i ~ V ~ A ) 



(32) 




-A - B + [n a + n b )fx - (n a - n b )\ 
+A - B - (n a - n b )fx + (n a + n 6 )A 

The minimizer (/t, A) of problem (P) can be computed as follows: 

1. If n a = (a similar reasoning can be done if n b — 0), then function 5 
reduces over Q to: 

S(jm,X)= J2 (h-v-\) 2 , 
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and the level lines of S degenerate to the straight lines fi+X = constant. 
The unconstrained minimizers (jl, A) of S are then on the line: 

(L) n( f ji + X) = B. 

If the line (L) goes through Q, problem (P) has an infinite number 
of solutions, with at least one of them (in general two) being on the 
boundary of Q. If (L) does not hit Q, the unique solution of (P) is 
located at the corner of Q closest to (L). In both cases, (P) admits 
at least one solution (/}, A) on one edge of Q. If we denote by Q* the 
rectangle on the other side of this edge (for which h a = 1 7^ 0), one sees 
that (/}, A) G Q*, so that the minimum £K n of S over Q* will necessarily 
be smaller than S min , the minimum of S over Q (as (/t, A) G Q*). So 
there is no point in computing S min , and we can skip the resolution of 
problem (P). 

2. If n a > and > 0, the unconstrained minimizer (/2, A) of S is unique. 
It is given by: 

E ie / a a i = Mi 1 ~ A) = n a a, 
T,iei b b i = ^(/^ + X ) = n b/3- 

If {fi, A) G Q then set (i — jl , A = A, and problem is solved. 
If not, go to the next step. 

3. Compute the L 2 -projection (jl, A) of (fi, A) on Q : 

{m_ if fx < m_ [ if A < /_ 

/i if m_ < /i < m + , A = < A if Z_ < A < /+ (34) 
m + if m + < fi y l + if l + < A 

4. If the projection is on a edge of Q, say for example fi = m- , Z_ < A < /+ 
(left edge), determine A which zeroes the component of VS along this 
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edge (here the second component as the edge is parallel to the second 
axis /i = 0): 

+ ^2 di — n a (m_ — A) — ^2 hi + n b (m^ + A) = 0. (35) 

Then set: 

{l_ if I < l_ , 
1 if l_ < I < 1+ , , (36) 
l + if m + < A , 

and problem is solved. 

5. If the projection is at a corner of Q, say for example fx = m_ , A = /_ 
(lower-left corner), evaluate the gradient VS = (g^gx) at the corner. 

• If 9\i > and > 0, set /t = m_ , A = /_, and problem is solved. 

• If g^ < and g>, > 0, (the objective function is decreasing when 
one leaves the lower-left corner to the right on the lower edge of 
Q), determine A which zeroes the component of VS along this 
edge (here the first component as the edge is parallel to the first 
axis A = 0): 

- ^2 ai + Ua & ~ l -) ~^2 b i + n &(A + 1-) = 0- (37) 

Then set: 

A={ 4 " f- m+ ' , A = L, (38) 

1 m + if m + < fx , 

and problem is solved. 

• If 9n > and g\ < 0, similarly determine // which zeroes the 
component of VS* along the left edge of Q: 

+ ^2 a i ~~ n ok m - — A) — ^2 bi + n fe (m_ + A) = (39) 
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Then set: 



A = 



A if Z_ < A < I 
1+ if 1+ < I, 



(40) 



and problem is solved. 
• The case < and g\ < cannot happen. 



The minimum value S min of S over Q is then: 



S min = ^2 - 2Aa + a 2 + 5 2 - 2B(3 + (3 2 , 



(41) 



where a — jl — A and (3 — jl + X. 
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